Intro

My name is Michael Vigdorchik (314980574), computer science student who is interested in the abundant world of data science and deep learning.

In this semester I decided to participate in the data science workshop course and seize the opprotunity to learn this topic by hands on experience approach. Along the course I planned to build my final project on the topic of medicine. While searching Kaggle, I encountered many datasets that refer to medical conditions but out all that I have seen, I chose the Diabetic Retinopathy Dataset for my project.

Diabetic Retinopathy is an eye disease that leads to blindness, and is most common amongst people with diabetes. Diabetic retinopathy is a condition that occurs with changes in the blood vessels of the retina. There are two stages of diabetic retinopathy:

Anyone with diabetes is at risk for diabetic retinopathy. The longer you have diabetes, the more likely you are to develop diabetic retinopathy.
Your risk rises if you have diabetes and you also smoke, have high blood pressure, or are pregnant.

More information on the condition can be found here: https://www.hopkinsmedicine.org/health/conditions-and-diseases/diabetes/diabetic-retinopathy

The Problem and Motivation

While manually examining this dataset of images , I realized that I would want to construct a neural network that would attempt to diagnose the severity of the condition according to some essential features of the disease that may be present in the image of the eye ball.
Then calssify the condition to the most likely severity.
The severity is categorized to 5 labels:

  1. Healthy (Not DR)
  2. Mild DR
  3. Moderate DR
  4. Proliferative DR
  5. Severe DR DR: Diabetic Retinopathy

After reading further about the condition, and inspecting the images, (from a computer science student without any medical knowledge obviously) I concluded that it will be a challenging task to diagnose a mild or even a moderate phase of the condition.

According to Wikipedia: Diabetic retinopathy affects up to 80 percent of those who have had diabetes for 20 years or more.
At least 90% of new cases could be reduced with proper treatment and monitoring of the eyes ... Each year in the United States, diabetic retinopathy accounts for 12% of all new cases of blindness.
It is also the leading cause of blindness in people aged 20 to 64.

This gave me more motivation to fulfill my project on this topic and attempt to lay the foundation of a Convolutional Neural Network model to function as a diagnosis tool.

The dataset consists of color images of left,right eye pairs. Each pair of images represents a case of DR. (One of the 5 cases mentioned above).

Structure of the Dataset

The structure of the dataset is quite simple.

Link to dataset: https://www.kaggle.com/datasets/tanlikesmath/diabetic-retinopathy-resized

The project will run within the Google Colab environment with preinstalled packages/libraries such as Keras, OpenCV, Sklearn ...

Download the Dataset from Kaggle to Google Colab

EDA

Load the csv file with the DR condition severity scores to a pandas dataframe

Get rid of the 2 first columns. They have unnecessary information.

It seems that we posses many cases of healty eye images, but we don't posses alot of unhealthy specimens

Let's take a look at the images height and width dimensions distributions to make sure whether they are consistent in the dataset.
Create a simple iterator for the dataset, in order to iterate over all the images in the directory

Let's add two additional columns of width, height to our dataframe for each opened image

Now let's plot the distributions of width and height

We may see that the width is consistent but the height isn't.
We'll have to make sure to resize the height to a fixed value prior to attempting any neural network procedures. (Resize to 1024).
An additional concern of mine is that the images are quite large.
1024x1024 features/pixels may turn out to be computationaly "difficult" to some extent for our future model.

Let's see how the resize function works on some images

We are not interested in streching the image till it reaches the target size, as seen above.
The image aspect ratio is no longer kept.
Thus, We need to find a way to pad with black pixels.

Now that's the form of images we want to work with.
The aspect ratio is preserved and the additional padding needed was applied succesfully.

Let us now attempt to answer the question: What makes an eye image to be classified with DR severity level greater than 0 ?
We need to evaluate some of the pathological signs caused as a result of the disease.
I will manually pick three "good" images with the following levels: Healthy (0), Moderate (2), and Severe (4).
The difference between each pair of levels is equal on the level scale.
Let us try to distinguish between the three cases in order to get some ideas on what features correspond to each case, and hopefully get some expectations later on.

bold text## What are we looking for ?

Diabetic retinopathy.jpg

Picture source: http://dpeh.in/diabetic-retinopathy-treatment/

DR.jpg

Healthy Level 0 examples:

Healthy sample overall. Maybe there is some negligible signs of DR, but it maybe the poor quality of the picture. Our model will have to deal with these poor conditions later on.

Moderate Level 2 examples:

You may see some signs of DR:

Severe Level 4 examples:

We may observe clearly that the eye suffers from several signs of DR:

Some images are darker than the others, some are brighter, some are cropped, and some have poor detail resolution.
We need to be able to discover features that are not profoundly affected by the above disturbances.

Some Image Processing

Attempt 1: Improve brightness conditions with gray scale and contrast enhancment

This is an original level 4 image vs original gray scaled.
You can see at the bottom of the image that the eye has some cotton wool spots or hard exudates. The gray scale made a bit clearer.
Let's try to make it even more distinguishable.

Overall improvement. But the function is not optimized and takes way too long for such an improvement.
Also consider that we have 35K images to process. I also tried to tweak with the α and β parameters but did not get any significantly better results.
Let's search the internet for something alternative and better.

Attempt 2: addWeighted method of OpenCV

Use the cv2.addWeighted(source1, alpha, source2, beta, gamma[, dst[, dtype]])
It is a technique to blend two images togheter, where each image source has its corresponding weights.
The merged images are the source (original) image and its gaussian blur transform.
The blurring of an image means smoothening of an image i.e., removing outlier pixels that may be noise in the image.
This technique will hopefully improve some of the essential details for us.
The sigma parameter of GaussianBlur is the standard deviation value of kernal along horizontal direction.

Ok, now that's way better results relative to previous attempt. We can observe better the severity differences between different cases.
What's important is that the significant features of DR are enhanced.

Let's checkout how it look in gray scale:

Let's see how a larger sigma affects the images. Let's change sigma from 10 to 30:

Some details are more emphasized, but on the other hand, some images suffer from a "white ring" near the edges of the circumference.
This means that their is a trade off here when increasing sigma parameter.

Attempt 3: cv2.detailEnhance and cv2.edgePreservingFilter.

Another technique to to check out

The results look satisfying in some images, but I think this technique is less better than the one above with the addWeighted function.

Systematic Bias

We're still in the EDA phase.
The next step of the project will be to feed in the "resized, black padded and light enhanced" images
into some classification algorithm of Sklearn to begin with as a baseline.

To make sure we don't create what's called a systematic bias towards the black padded images,-
let us check first whether the images that need to be padded are mostly "healthy" images, or "moderate", or other category.

Let's plot the above counters in such a way, that we can understand whether there is
a systematic bias towards fixed 1024 sized images or towards different sized images.

The amount of images that are different than 1024 x 1024 (need to be padded) greatly outnumber, by almost x2
the number of images that already come in size 1024 x 1024 in each category class.
This means that - regardless of padding the images to a fixed size, we get a systematic bias towards black padded images.

Classic classification algorithms do not "handle well" raw images input. (also images after some transormation such as:
padding or change in lighting conditions).

Therefore, we need to find a way to extract features so we can feed in these features instead of the raw (or transformed) images.

We will do this later, after we demonstrate how a "classic" classification algorithm fails with raw images.

Resize all images to fixed dimensions: 1024 x 1024

We may now drop the width, height columns from our dataframe. We won't use them.

Demonstration of classic classification algorithm on raw transformed images

We'll be using one of sklearn's algorithms

The whole resized dataset cannot fit into memory,
so we'll have to program a batch generator that will provide batches of
of our dataset: images, labels. Each generated batch must align with the requirements of sklearn modules.

The following class will be useful for us when implementing the training with the batch generator.
It provides a convinient way to access image and its label.
Notice that we apply a transformation on the image upon accessing it.

Integer divisors of 35108 are: 2, 4, 67, 131, 134, 262, 268, 524, 8777, 17554
We'll choose the batch size from one of these values.

Some tools for batch processing in sklearn

Define our resized images dataset for sklearn:

Perform a split to get training indices and validtion indices:

Prepare the train_names for the train function, and validation_names for accuracy estimation later on.

Let's train the classifier that we have selected

I chose Multinomial Naive Bayes classifier according to "Choosing the right estimator" guideline in skleran website:
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

We'll be using 3 tools to evaluate our models:

Attempt to predict on validation set

We can see from the report above that out of roughly 7700 healthy samples, our model prediced correcty only about 1200.
The overall performance of the model is not good at all.

Kappa or Cohen’s Kappa is like classification accuracy, except that it is normalized at the baseline of random chance on your dataset: It basically tells you how much better your classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class.

Quadratic Weighted Kappa test score: 0.0233 indicates that our naive bayes model performs as if it guessed poorly.

The precision score of 0.213 does not look promising either.

This was a demonstration of a classical algorithm, failing to categorize,
when given transformed images with black padding bias.

Feature Extraction with HOG

As we mentioned erlier, before demonstrating the failure of above, we want to be able to rely on features extracted from the images
independatly of the image's dimensions and padding situation.

In this EDA pahse, we will try and acheive that with "Histogram of Oriented Gradients" (HOG) technique.

In the HOG feature descriptor, the distribution ( histograms ) of directions of gradients
( oriented gradients ) are used as features.
Gradients ( x and y derivatives ) of an image are useful because the magnitude
of gradients is large around edges and corners ( regions of abrupt intensity changes )
and we know that edges and corners pack in a lot more information about object shape than flat regions.

HOG descriptors may be used for object recognition by providing them as features to a machine learning algorithm.

I used these articles:
https://www.thepythoncode.com/article/hog-feature-extraction-in-python

https://www.techgeekbuzz.com/blog/how-to-apply-hog-feature-extraction-in-python/

Let's focus the HOG on a small patch of the image where we can clearly notice a "spot" (white spot).
Also check if the contrast enhancement does any difference:

Explanation of the parametes of HOG function:

Every specified patch pixel will represent a star, which is a combination of gradient magnitude and angle.

The parameters that yielded the "best" hog image where the main spot is seen clearly by the surrounding stars, and the hard exudates at the right bottom corner are also seen to us.

We got this after some tweaking with the sigmaX parameter of the GaussianBlur function and hog parameters.

The feature values are larger in the right plot than in the left.

Now let's observe the HOG of the entire image

Define the patch size. Resize to 1:2 aspect ratio

The "circles" are now more noticed

We get HOG features vector of size:

For convinience we want to be able to execute sklearn's SVC algorithm on the HOG features.
Since sklearn does not support partial_fit method for SVC we cannot perform batch process as we did before,
and cannot input the entire train dataset with features of size 288036 because we will run out of available memory and it will take way too long.
Hence, we'll attempt to customize the HOG function parameters in such a way that
we get "reasonable" features for the price of reduced features vector size.
The goal is to reduce the features size to an order of magnitude in the thousands.

After alot of experimentation, this is the acheived result:

Crop to get rid of the background (to avoid any bias)

We can infer some DR related features from this HOG image, less than the previous, but in return we get a features vector size of: 7200
Here is a plot of the raw features vector:

Display some HOG features of Level 0 DR

Display some HOG features of Level 3 DR

End of EDA

After exploring and understanding the data we are working with, it is time to build a model that will be able to categorize each image to its correct label.
First we'll try out some of Sklearn's classic machine learning algorithms.
These algorithms will get the HOG features vector as input instead of the raw images.
Later on we would attempt to build a Convolutional Neural Network that will attempt to do the same.
This CNN will be based of a known pre-built (maybe even pre-trained) model.
We only add few external layers to it in order to adjust it to our categorization needs.
This method of model building is called "Transfer Learning".

The common goal for all the methods above is to learn the profound features that represent each category.

SVC classifier on HOG feature vectors

Define our extractor class that responsible for loading an image and calculating its HOG vector.

Prepare the dataset for sklearn's model

Extract DR features from all images.
The expected number of features per sample is 7200

Split the HOG features matrix to train/test groups:

SVC model with RBF kernel on the HOG features vectors

Train the model

Return the mean accuracy on the given test data and labels.

Display the confusion matrix

It seems that the model just "guesses" with probability if 75% if the a given healthy sample is indeed health.
It also confuses with healthy and moderate.
That's a progress from the previous attempt with MultinomialNB model on the raw images.

The report above reflects the RBF kernel model. It does somewhat decent job in predicting healthy samples with 0.75 accuracy.

Can SGD Classifier provide better results ?

0.63 score which is slightly better than SVC-RBF's 0.5

It does quite decent job in guessing the healthy samples, but overall poor job of guessing the other categories.

Let's try the previous Multinomial Naive Bayes model but on the HOG features vectors

The matrix does not seem very promising even though the score we got is 0.69. It predited most of the samples as healthy.

So far SGD classifier provided the best model.
Now let us try a different approach to the problem. Convolutional Neural Networks.

CNN with Keras

Let's try a more modern approach, using a CNN based techniques with keras and TensorFlow.

Shuffle the original images-labels DataFrame:

Data Balancing

An imbalanced dataset is a dataset where each category class is represented by a different number of input samples.
Decrease the number of the samples of the biggest class down to size of the smallest class.
(Since we cannot generate synthetic data to increase the number of severe/proliferative casses).

In the following code we'll show how to adjust the number of samples in each category according to our desire.

Because the overall balanced dataset is almost 6 times smaller than the original due to above, we'll keep it for later use.
Our first CNN attempt will be on the original dataset.

Split the original DataFrame into two DataFrames: Train and Validation:

Preprocess the images and save to my google drive for future use

Download the preprocessed images from google drive (for new colab sessions)

Attempt 1 - ResNet50

Let us start with a simple "light weight" convolutional neural network called: ResNet50.
This architecture can be used on computer vision tasks such as image classififcation, object localisation, object detection.
We would make our first attempt with this architecture as a base network.

Create our own customized image tensor generators for training and validation.
It's good practice to use a validation split when developing the model. 80% of the images for training and 20% for validation.

Some model parameters:

Check that the dimensions that the generator yields are correct

Now let's construct our model around the prepraired ResNet50 model.
This model should be light enough to not crash our session with Colab's GPU.
According to ResNet50 documentation, we need to perform some additional preprocessing on the input:
https://www.tensorflow.org/api_docs/python/tf/keras/applications/resnet50/ResNet50
Thus, we add the tf.keras.applications.resnet50.preprocess_input layer to the model:

While training, we might reach a "learning plateau". Since we don't want to overfit on the training data,
we'll use Early Stopping callback:

This callback will stop the training when there is no improvement in the validaion loss for 4 consecutive epochs.

We'll use keras's default learning rate, which is 0.001

Plot the training and validation data

We can see that the model predicted all the samples to be healthy. That is a terrible model. It did not catch any meaningful features.

Attempt 2 - EfficientNetB3

Create a custom model based on existing one: EfficientNetB3
EfficientNet has proven to be a very good baseline for many computer vision tasks.
Further information on all sorts of architectures can be found here:
https://keras.io/api/applications/

Here is a graph that depicts of all kinds of architectures for image tasks:

nets.png

https://www.tensorflow.org/api_docs/python/tf/keras/applications/efficientnet/EfficientNetB3
Note: each Keras Application expects a specific kind of input preprocessing.
For EfficientNet, input preprocessing is included as part of the model (as a Rescaling layer), and thus tf.keras.applications.efficientnet.preprocess_input is actually a pass-through function.
EfficientNet models expect their inputs to be float tensors of pixels with values in the [0-255] range.

Thus, we provide the norm=False argument to the class constructor:

The overall idea of the architecture of the additional layers were taken from: https://keras.io/examples/vision/image_classification_efficientnet_fine_tuning/

A technique to reduce overfitting is to introduce dropout regularization to the network.

When you apply dropout to a layer, it randomly drops out (by setting the activation to zero) a number of output units from the layer during the training process.
Dropout takes a fractional number as its input value, in the form such as 0.1, 0.2, 0.4, etc.
This means dropping out 10%, 20% or 40% of the output units randomly from the applied layer.

Let's create a new neural network with tf.keras.layers.Dropout before training it using the augmented images:

Due to the decrease of the traning/validation accuracy we can infer that the model overfits.
The model overfits "towards" the healthy samples, because the dataset is imbalanced and these healthy samples are majority.

If we focus on the F1 score of each category in the report we may see that the model performs very well on the healthy samples, and quite decently on the moderate and severe samples.

Visualizing SHAP values output

This tool helps us to understand the predictions of the model based on the features that it focuses on.

Pickup some images from 3 main categories: healthy, moderate, severe

Perform SHAP evaluations on these images

The 5 graphs with the blue/red squares are the top 5 guesses of the model from left to right.
Each guess is "explained" here by the SHAP values. Each square indicate the "strength" of the feature in that area.
The stronger the feature the more red it is. The weaker the feature, the less affect it has on the probability to predict a specific category.

The SHAP values show us that the main features of the healthy sample here are the peripheral veins.

Here, the SHAP values emphasize the damaged blood vessels and on some area that develops hard exudates.

Here, their is strong focus on a specific area that contains a unique feature of the severe and proliferative categories: noticeable hard exudates and white cotton wool.

Attempt 3 - EfficientNetB3 - Fine Tuning

This time with a balanced train/validation set.
This model actially takes input images of shape (300, 300, 3), and the input data should range [0, 255].
Unlike in the previous attempt where we provided input of shape (512, 512, 3).
Normalization is included as part of the model.
https://keras.io/examples/vision/image_classification_efficientnet_fine_tuning/
We'll use the balanced dataset that we created early:

Build and run our custom model while base model's weights are not trainable

Do another round of training from were we finished, since it seems that 20 epochs were not enough and the model may still improve

Graph of metrics of both fit calls:

Do a round of fine-tuning of the entire model.
Let's unfreeze the base model and train the entire model end-to-end with a low learning rate.

Validation data predictions:

Now that's an interesting result we got there.
It seems that the model managed to comprehend some essential features of each category.

Let's observe SHAP values of this model to get explanation regarding the features that the model has learned:

Although this sample was categorized correctly by the model, it is still not clear enough whether its a healthy sample. If you look closely you can see development of hard exudates around the upper blood vessels, but this may be false because of the poor image quality.
This is a good example that shows how much effect the image quality actually has.

This sample was not categorized correctly according to the dataset, instead of moderate (in second place), it got proliferative. It is worth mentioning that this case just might realy be proliferative due to the condition if we look at it.
The model detected many features of the proliferative case.

Here the model predicted severe correctly according to the profound development of white cotton wool spots, hard exudates and some abnormal and damaged veins.

Attempt 4 - EfficientNetB3 - Rebalance and Allow Trainable Base Model

We cannot perform a center crop of the images to 300x300 because we lose alot of information
on the edges of the eye, and this information may be critical.
Therefore we will resize the image to 300x300 as before, but this time we'll use a differently balanced dataset.
The new balance of the dataset is as follows:

Unlike the previous dataset, this one includes much more moderate, mild, and healthy samples.
But not to many to not cause too much bias.

Reason why not to center crop the image

To feed the network with images of shape 300x300, instead of resizing from 512x512, we can perform a center crop to get the central fraction of the image without the black edges. This may be good for us because it gets rid of the bias that caused by it. (discussed above).
Let us see how this center crop affects the images:

The information n the edges disappears and that is not good enough for us because there may be critical features of sample that lay somewhere in the edges that may profoundly affect the case severity.

May be EfficientNetB6 that has input resolution of 528x528 may be a good attempt later ?

Define new balanced dataset, build and run our custom model while base model's weights are trainable

Define the new balanced dataset:

The valiation loss kept rising, so early stop was activated.

The valiation loss increases du to overfitting towards the healthy samples over time. (epochs)
Now let's see the confusion matrix on the validation set

It is quite similar to the previous attempt, where we first trained.

Now we shall try to perform predictions of a new test set.
This test set is equaly balanced (708 samples for each class) and the samples are randomly
chosen from the entire samples dataset. (The original dataset).

Looks marvelous !!!
The model succesfully predicted each category with minor error for our given test set.
We shall perform a prediction on the entire 35108 samples dataset and draw its confusion matrix and report.

The F1 scores show us a pretty good stats. Above 80% prediction success rate for
the Proliferative, Severe, and Healthy samples.
Notice that it does a very good job distinguishing between Proliferative and Severe casses.
Only 23 Proliferative samples were diagnosed to be Severe and only 15 Severe samples were diagnosed as Proliferative.
But!, even though the model did not predict impressively for the Mild and Moderate samples, it is worth noticing that it predicted about 7000 healthy samples as Mild or Moderate. (First row).
This model prefers (most of the times) to predict a false positive DR diagnosis for Healthy samples rather than predict-
a true negative for actual Mild/Moderate casses.
The low F1 score for Mild samples can be explained by the poor quality of the original images. Some healthy images are come in poor quality in the first place, so that our image processing technique wouldn't be able to enhance any slightly profound features.
As the saying: "Its not the dancer, its the curved floor". 😏
Even the diagnosis of the doctors may be in-precise in some samples. It's a long and complicated job to figure out which samples are inconsistent with the correct diagnosis.
The bottom line is that we possess a CNN model that is capble to categorize DR samples with "good" precision.

SHAP Explanation

The above SHAP values mainly focus the healthy tissue near the Fovea (the central cirle) and the healthy blood vessels.

The SHAP values focus on the abnormal blood vessels and aneurysms.

The above SHAP values show us the profound features of severe and proliferative DR such as: "Cotton Wool" spots and abnormal blood vessels.

Conclusions

Results

Classic Sklearn algorithms preformance on HOG vectors

SVC RBF Kernel SGD MultinomialNB
Macro Average F1-score 0.22 0.2 0.19

CNN preformance on images

ResNet50 - attempt 1 EfficientB3 - attempt 2 EfficientB3 - attempt 3 EfficientB3 - attempt 4
Macro Average F1-score 0.17 0.58 0.46 0.72

Improvements and Ideas

Due to time constraints I did not manage to implement and check these ideas.

Self Reflection

I knew that tackling the Diabetic Retniopathy problem was not an easy task what so ever.
Just by looking at the data and examining all the categories, I could barely notice the subtle differences between neighboring DR severity levels. 2 and 3 looked very similar, 3 and 4 looked the same.
Therefore it sparked me with genuine interest on whether a machine learning or deep learning model will be able to distinguish between the different features.
Along the path to reach the good model from above (Last attempt with EfficientNetB3) I encountered many difficulties and challenges.
This course is actually an intro for me to the vast world of data science, machine learning and deep learning.
Most of the concepts that are presented throughout this notebook were completely new for me, and I learned them along the course.
I read major parts of this book: "Deep Learning" by Ian Goodfellow, and others, and also watched Prof. Alex Bronstein's Youtube series: "Deep Learning on Computational Accelerators".
These materials gave me a solid ground knowledge to begin tackling the problem with sufficient understanding of what I'm doing and what I want to acheive.
The course provided me a window to the actual real world tools with good explanations, so I am greatful for that.
Using Google's colab envioroment was very helpful. I enjoyed it. The envioroment is very intuitive and integrates great with Google Drive.
I'm actually pretty impressed with myself that I manged to pull this of.
While enlisting to this course I was aware of the knowledge gap, and decided to take the challenge upon myself, because I believe in my learning abilities.
I think it turned out well for me and I can say that this course was one of the most instructive courses I had during my studies.